AD-R156  7S7  EFFICIENT  SVSTOLIC  RRRRVS  FOR  THE  SOLUTION  OF  TOEPLITZ 
SVSTEHS:  RN  ILLUST. .  <U>  VRLE  UNIV  NEW  HRVEN  CT  DEPT  OF 
COMPUTER  SCIENCE  J  H  OELOSHE  ET  RL.  JUN  85 
UNCLRSSIFIED  VRLEU/'DCS/RR-37e  Neeei4-82-K-0184  F/G  3/2 


Efficient  Systolic  Arrays  for  tlic  Solution  of  Toq>lits 
Systems  :  An  XUnstration  of  a  Methodology  for 
the  Constmetion  of  Systolic  Architectures  in  VLSI 

Jeac-Marc  Delosme*,  Ilse  C.F.  Ipsen^ 

Research  Report  YALEU/DCS/RR-370 
June  1985 


Abstract  Advanced  high  resolution  methods  for  real-time  signal  processing  require  fast  multipro¬ 
cessor  architectures,  realized  as  Very  Large  Scale  Integrated  systems,  for  their  implementation.  A 
class  of  parallel  architectures,  ‘systolic  arrays’,  fulfills  the  demands  of  signal  processing  and  at  the 
same  time  conforms  to  the  limitations  of  VLSI  technology. 

The  design  of  efficient  systems  of  systolic  arrays  is  an  iterative  process  in  which  the  infor¬ 
mation  gathered  from  mapping  algorithms  onto  hardware  is  influential  in  the  development  of  the 
algorithms.  Presently,  the  absence  of  mappmg  tools  makes  it  extremely  difficult  to  find  good 
systolic  implementations  for  many  important  problems.  Often,  one  cannot  do  better  than  im¬ 
plementing  structurally  simple  but  numerically  inferior  algorithms,  which  is  undesirable  for  most 
signal  processing  tasks. 

A  methodology  is  proposed  that  automates  the  mapping  of  recurrence  equations  to  processor 
arrays.  Two  aspects  distinguish  our  methodology  from  extant  work  :  (1)  complex  coupled  ay»- 
terns  of  recurrence  equations  cw  be  systematically  treated  and  (2)  the  resulting  systolic  systems 
are  optimal.  The  methodology  takes  as  input  recurrence  equations  describing  the  algorithm,  along 
with  certain  desirable  hardware  features.  A  sophisticated  optimization  procedure  is  then  applied  to 
generate  the  description  of  optimal  arrays^hat  implement  the  algorithm.  If  there  is  no  sati^actory 
implementation,  the  algorithm  will  have  t\be  revised,  in  order  to  improve  pipelinability,  without 
sacrificing  numerical  stability.  Several  class^  of  applications  will  significantly  benefit  from  this  ap¬ 
proach  :  fast  analysis  of  parallel  algorithms,  (Sclent  partitioning  of  algorithms,  and  determination 
of  optimal  architectures  for  the  implementation  of  multiple  algorithms. 
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1.  Introduction 

Systems  for  such  applications  as  radar  and  sonar  detection  and  identification,  robot  vision, 
medical  imaging  and  geophysical  exploration  have  to  extract  features  from  or  make  decisions  based 
on  large  amounts  of  data.  Such  systems  can  typically  be  decomposed  into  two  parts,  a  front-end  and 
a  classification  subsystem.  The  front-end  collects  the  data  and  performs  filtering  and  parameter 
estimation  to  extract  useful  information  from  the  measurements;  the  classification  sub-system  infers 
from  the  information  forwarded  by  the  front-end  the  occurrence  and  nature  of  certain  events. 

In  order  to  minimize  the  decision  time  or  to  reduce  storage  by  orders  of  magnitude,  data  often 
must  be  processed  in  real  time.  In  order  to  keep  up  with  the  acquisition  rate  of  the  uncompressed 
data,  the  front-end  has  to  be  able  to  perform  several  hundred  million  operations  per  second,  and 
therefore  has  to  consist  of  many  processors.  The  cost  of  such  systems  is  minimized  when  the  desired 
throughput  is  attained  on  the  smallest  number  of  processors.  This  is  accomplished  by  implementing 
Tast’  pipelinable  algorithms  which  allow  a  minimal  operation  count  as  well  as  chaining  of  operations 
in  a  high  efficiency  pipeline. 

The  processors  themselves  should  be  fast  enough  and  hence  must  be  special  purpose  arithmetic 
computers.  For  reasons  of  reliability,  cost,  size,  weight  and  power  consumption,  they  are  presently 
realized  as  VLSI  chips.  Because  of  their  impact  on  the  whole  system,  some  limitations  of  the  VLSI 
technology  must  be  taken  into  account  at  the  outset  of  the  design  process.  The  most  important 
axe  the  high  design  cost  for  developing  novel  processor  architectures  and  the  limited  Input/Output 
capabilities  of  chips  due  to  small  pin  counts  (on  the  order  of  one  hundred)  [59].  These  constraints 
combined  with  the  desire  for  efficient  pipelining  have  led  to  the  concept  of  ‘systolic  arrays’  [36,  42). 

Systolic  arrays  consist  of  a  few  types  of  processors  that  perform  elementary  operations  (such 
as  additions  or  multiplications)  selected  by  a  simple  control  unit.  The  pattern  of  interconnections 
among  processors  is  regular  and  each  processor  is  connected  to  only  a  few  other  processors.  Since 
the  types  of  operations  performed  arc  data  independent,  the  processors  can  be  made  to  operate 
synchronously  without  affecting  the  global  performance;  this  simplifies  the  control  of  the  processors 
and  reduces  the  area  needed  to  implement  them  in  silicon. 

The  process  of  front-end  design  can  be  decomposed  into  three  steps  : 

•  Development  of  signal  processing  techniques.  Modern  techniques  optimize  a  given  criterion 
(for  instance  to  minimize  some  error  norm)  and  exploit  prior  information  (often  through  the 
choice  of  a  particular  statistical  model  for  the  incoming  signals). 

•  Development  of  fast,  stable  and  pipelinable  algorithms  that  perform  the  optimization.  It  will 
often  be  necessary  to  increase  the  number  of  operations  (by  introducing  redundancy  for  exam¬ 
ple)  in  order  to  increase  stability  (permitting  simpler  processors)  and  pipelinability  (resulting 
in  higher  efficiency). 

•  Mapping  of  the  algorithms  onto  minimal  cost  systolic  architectures  that  deliver  the  desired 
throughput  and  comply  with  the  current  technological  constraints.  The  best  architecture  may 
be  a  combination  of  several  systolic  arrays  with  different  processors  or  interconnection  pat¬ 
terns. 

Clearly  the  three  design  steps  are  not  completely  independent  and  the  best  designs  will  be 
obtained  through  an  iterative  procedure;  if  there  is  no  satisfactory  implementation  of  an  algorithm 
it  will  have  to  be  revised.  To  make  such  iterations  practically  feasible  the  time  consuming  mapping 
process  must  be  automated  and  minimal  cost  mappings  must  be  determined  at  every  iteration. 

Recently  there  have  been  mamy  attempts  to  formalize  the  usual  trial-and-error  procedure  for 
the  mapping  of  computations  onto  an  array.  (These  are  surveyed  in  the  Appendix.)  Generally 
these  mappings  are  unduly  restricted  and,  at  best,  the  designs  are  enumerated;  the  corresponding 
techniques  do  not  guarantee  the  determination  of  a  minimal  cost  design.  However,  we  believe 
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that  it  is  possible  to  find  optimal  mappings  within  a  reasonable  amount  of  time,  even  for  complex 
algorithms  that  are  mapped  onto  a  collection  of  systolic  arrays  or  for  algorithms  that  require  several 
passes  on  the  same  array. 

The  class  of  algorithms  implemented  on  front-ends  is  large.  If  the  optimization  criterion 
consists  of  minimizing  summed  squared  errors,  the  algorithms  call  either  for  matrix  inversions  and 
factorizations,  or  for  matrix  eigenvalue  or  singular  value  decompositions,  depending  on  the  model 
for  the  signals  {69].  The  processing  techniques  may  also  require  the  evaluation  of  correlations  or 
convolutions.  All  these  algorithms  can  be  expressed  as  systems  of  recurrence  equations.  Due  to 
the  wide  variety  of  these  systems  the  goal  of  the  proposed  mapping  method  is  to  find  the  optimal 
mappings  for  any  system  of  recurrence  equations. 

2.  Motivation  for  a  Design  Methodology 

Up  to  now  most  of  the  time  and  effort  in  systolic  array  design  has  gone  into  finding  an  array 
that  performs  the  desired  computation  (the  terminology  ‘systolic  array'  should  be  understood  in 
the  wider  sense  as  including  collections  of  systolic  arrays) .  If  an  algorithm  had  what  seemed  to  be 
‘too  complex'  a  structure  to  be  mapped  on  an  array,  an  ‘easier'  method  that  appeared  to  have  a 
better  chance  of  being  implementable  had  to  be  chosen.  Consequently,  the  numerical  performance 
of  the  algorithm  had  to  be  sacrificed  in  order  to  obtain  a  processor  array,  frequently  one  which 
would  execute  a  numerically  inferior,  possibly  even  unstable,  method.  For  most  signal  processing 
tasks  this  kind  of  inaccurate  and  unreliable  information  is  unacceptable.  Increasing  the  precision 
(the  number  of  bits  used  to  represent  a  numerical  value)  is  usually  no  remedy  :  it  slows  down 
the  computation,  increases  the  area  of  each  processing  element,  may  also  render  the  computation 
Input/Output  bound,  and,  after  ail,  still  does  not  lead  to  satisfactory  numerical  results. 

Application  of  our  design  methodology  will  allow  the  designer  of  systolic  systems  to  concentrate 
on  the  algorithm  development,  particularly  on  modifications  for  improved  numerical  behavior  or 
pipelinability,  rather  than  on  the  mapping  process.  Besides  reducing  the  design  time,  this  will 
give  the  designer  confidence  in  the  hardware,  since  the  resulting  systolic  array  will  be  correct  by 
construction.  Consequently,  no  a  posteriori  verification  process,  as  described  in  [10,  12,  38,  39,  51, 
52,  53,  70],  will  be  necessary. 

Automatic  design  of  systolic  arrays  has  been  addressed  from  different  points  of  view  :  algebraic 
manipulations  with  a  time  delay  operator  [13,  26,  27,  28,  34,  37,  71,  72],  iterative  improvement 
of  a  given  design  by  retiming,  [33,  40,  41,  42,  44,  45,  46],  inductive  construction  from  first  order 
recurrence  equations  [ll],  and  determination  of  linear  transformations  applied  to  a  geometric  rep¬ 
resentation  of  the  recurrence  equations  [2,  6,  7,  8,  9,  17,  19,  18,  30,  31,  32,  49,  50,  54,  55,  56,  57, 
58,  66,  67,  68] .  This  latter  approach  determines  systolic  designs  as  follows.  Taking  advantage  of 
the  regularity  in  the  recursions,  the  partial  order  of  the  operations  in  the  algorithm  is  summed 
up  through  ‘dependence  vectors'.  There  are  many  ways  of  evaluating  recursions  consistent  with 
the  partial  order.  The  ones  which  are  systematic  and  regular  can  be  represented  by  projecting 
on  a  single  ‘time  direction’,  i.e.,  there  is  a  global  time  arrow.  Constraints  on  the  possible  time 
directions  are  easily  deduced  from  the  dependence  vectors.  A  similar  notion  of  ‘processor  direction' 
allows  the  projection  of  operations  onto  processors  in  a  regular  fashion.  The  only  constraint  on 
the  processor  directions  is  linear  independence  with  the  time  direction  (which  means  that  no  two 
computations  can  be  performed  on  the  same  processor  at  the  same  time).  By  expanding  these 
ideas,  our  methodology  is  the  only  one  [18]  capable  of  systematically  handling  coupled  systems  of 
recurrence  equations  as  well  as  being  able  to  generate  provably  optimal  systolic  arrays. 

Optimal  designs  can  be  found  for  various  user-specified  constraints  on  the  architecture;  these 
may  include  restriction  of  Input/Output  to  boundary  processors,  specified  ordering  and  timing  of 
input  data,  avoidance  of  broadcasting  or  a  limit  on  the  number  of  processors.  Optimality  of  the 
mapping  is  critical,  since  an  inaccurate  assessment  of  the  parallelism  inherent  in  an  algorithm  can 
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lead  the  designer  to  make  unnecessary  modifications  to  the  algorithm.  An  array  is  defined  to  be 
optimal  if,  among  all  designs  satisfying  some  user-specified  hardware  constraints,  it  best  meets  a 
hierarchy  of  selected  criteria.  A  hierarchy  of  objectives  to  be  satisfied  in  sequence  is  introduced 
to  precisely  delineate  the  implementation  since  there  arc  generally  many  designs  that  attain  the 
optimum  for  a  single  objective.  In  a  real-time  implementation,  for  example,  given  throughput  is 
a  constraint  and  a  minimal  number  of  processors  would  be  a  primary  objective,  while  minimal 
latency  and  storage  could  be  secondary  and  tertiarj’  objectives.  The  use  of  such  a  cascade  of 
objectives  permits  a  precise  formulation  of  the  designer's  desiderata,  avoids  long  and  overwhelming 
enumerations  of  possible  solutions,  and  reduces  the  computation  time  needed  to  find  the  final 
design.  Still,  in  most  cases  these  multiple  objectives  amd  constraints  will  not  uniquely  specify  an 
optimal  design,  and  application  of  the  methodology  will  result  in  a  succinct  description  of  a  class  of 
systolic  arrays.  The  information  about  the  members  of  the  class  can  then  be  employed  for  further 
refinement  of  the  details  and  ‘fine  tuning'  of  the  systolic  array  specifications,  such  as  number  of 
registers  per  processors,  or  complexity  of  control. 

3.  Applications  of  the  Design  Methodology 

An  important  and  direct  application  of  the  methodology  is  the  determination  of  architectures 
on  which  to  implement  multiple  algorithms.  Three  more  classes  of  applications  for  which  the  design 
methodology  can  be  most  beneficial  are  illustrated  in  this  section  ;  improvement  of  algorithm  design 
and  analysis,  application  to  complex  problems,  and  problem  partitioning. 

3.1.  Improvement  of  Algorithm  Design  and  Analysis 

Since  the  methodology  will  automate  the  mapping  to  hardware,  more  time  and  effort  can 
be  spent  on  the  design  and  analysis  of  numerical  algorithms.  An  example  is  the  solution  of  linear 
systems  of  equations  with  symmetric  positive  definite  Toeplitz  coefficient  matrix.  An  algorithm  due 
to  Levinson  [47]  exploits  the  Toeplitz  structure  and  is  routinely  used  on  single-processor  machines. 
Although  it  consists  of  only  two  simple  coupled  recursions,  one  to  update  pairtial  solutions  to  the 
system  and  the  other  to  compute  certain  inner  products,  the  algorithm  does  not  readily  lend  itself 
to  a  systolic  array  implementation. 

The  methodology  first  establishes  that  the  algorithm  cannot  admit  to  an  efficient  systolic  array, 
because  the  same  vector  is  used  from  both  ends  in  the  recursion  for  the  partial  solution  update, 
thus  leading  to  highly  restrictive  dependences.  This  difficulty  b  resolved  by  working  with  two 
copies  of  the  vector  instead  of  just  one  [29].  Thb  repetition  duplicates  certain  computations  and 
increases  the  total  operation  count  by  fifty  percent,  but  eliminates  some  of  the  more  constraining 
dependences.  However,  the  methodology  shows  that  the  modified  algorithm  still  does  not  admit  to 
an  efficient  systolic  implementation  because  of  the  nature  of  the  interdependence  between  partial 
solution  update  and  inner  product  recursions. 

Further  analysb  indicates  a  way  of  drastically  improving  the  pipelinability  of  the  Levinson 
algorithm.  Thb  modification  of  the  algorithm  takes  advantage  of  the  Toeplitz  structure  when 
evaluating  the  inner  products  (and  increases  the  Levinson  operation  count  by  another  fifty  percent). 
At  thb  point,  one  notices  that  the  twice  modified  Levinson  algorithm  resembles  Schur’s  algorithm, 
except  for  some  extraneous  computations.  Thb  b  the  reason  why  Schur’s  method  b  used  instead 
of  Levinson's,  see  [1,  4,  5,  14,  35,  62]. 

Schur's  algorithm  b  a  fast  way  of  performing  the  factorization  of  the  Toeplitz  coefficient  matrix. 
To  solve  a  system  of  equations  two  more  steps  are  necessary,  a  forward  and  a  backward  substitution. 
In  1 15]  recursions  are  derived  for  these  two  steps  that  closely  resemble  the  Schur  algorithm  and 
minimize  the  number  of  intermediate  quantities  to  be  stored.  Numerical  experiments  show  that 
the  stability  and  round-off  errors  are  competitive  with  those  of  the  best,  currently  available  methods. 
In  [16]  the  methodology  b  applied  to  find  efficient  systolic  arrays  for  that  algorithm  under  various 
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user-specified  hardware  constraints.  This  example  is  presented  in  the  next  section  to  illustrate  our 
methodology. 

3.2.  Application  to  Complex  Problems 

Recent  years  have  witnessed  the  development  of  a  family  of  multichannel  exact  least-squares 
adaptive  algorithms  [20,  21,  63]  and  their  application  to  important  problems  such  as  the  estimation 
of  time-differences  of  arrival  for  target  tracking,  nobe  cancelling  or  adaptive  equalization.  These 
recursive  algorithms  exhibit  excellent  performance  and  are  ideally  suited  for  real-time  processing. 
However,  their  implementation  is  not  an  easy  task.  The  difficulty  stems  from  the  fact  that  they  are 
expressed  in  terms  of  matrix  operations  (the  order  of  the  matrices  being  the  number  of  channels)  and 
that  pipelinability  requires  the  proper  choice  of  algorithms  for  decomposing  the  matrix  operations 
into  scalar  operations.  This  implementation  problem  displays  a  tight  coupling  between  algorithm 
design  and  mapping  of  steps.  Our  methodology  will  allow  fast  evaluation  of  the  maximal  efficiency 
inherent  in  the  algorithms  and  will  guide  us  in  exploring  alternative  decompositions  of  matrix 
operations. 

3.3.  Problem  Partitioning 

Most  algorithms  A  are  formulated  in  terms  of  a  small  number  of  parameters  on  which  their 
computation  time  and  hardware  requirements  depend.  For  instance,  the  Gaussian  elimination 
process  on  a  square  matrix  of  size  n  takes  time  on  a  single  processor,  or  can  be  implemented  in 
time  n  on  a  systolic  processor  grid  of  size  proportional  to  n*  [3]. 

Of  interest  here  are  those  problems  A{n)  whose  resources  depend  on  the  maximal  number  n 
of  iterations  in  a  system  of  recurrence  equations  (for  matrix  problems  thb  would  be  equivalent 
to  dealing  with  a  matrix  of  order  no  larger  than  n).  Partitioning  the  problem  A(n)  then  means 
to  break  its  equations  down  and  possibly  reorder  them,  so  that  the  resulting  problem  A  can  be 
partitioned  into  p  independent  and  similar  subproblems  A{i)  which  can  then  be  solved  p  times 
faster.  Partitioning  b  used,  for  example,  when  the  number  of  processors  p  b  strictly  smaller  than 
the  dimension  n  of  the  problem  or  not  all  of  the  input  parameters  can  be  input  at  the  same  time; 
it  b  abo  of  fundamental  interest  when  the  hardware  b  used  for  problems  of  differing  sbes. 

For  example,  in  [25]  Gaussian  elimination  b  discussed  when  each  of  the  p  <  n  processors 
performs  eliminations  on  n/p  (not  necessarily  adjacent)  rows  or  columns  of  a  n  x  n  matrix.  Heller 
[23]  dbcusses  the  decomposition  of  recursive  filters  for  linear  systolic  arrays  while  Moldovan  et  al. 
[58]  sketch  a  way  of  mapping  the  unshifted  QR  algorithm  for  symmetric  matrices  onto  a  fixed  size 
rectangular  array.  In  [24]  partitioned  algorithms  and  corresponding  systolic  systems  are  derived  for 
LU-decompostion,  matrix  multiplication,  triangular  system  solution  and  matrix  inversion.  Priester 
et  al.  [64,  65]  rearrange  the  data  for  banded  matrix  vector  multiplication  to  fit  a  fixed  size  systolic 
array  of  [36].  A  general  survey  of  partitioning  methods  and  their  influence  on  systolic  processor 
design  b  dbcussed  in  [22]  for  some  basic  matrix  methods. 

On  account  of  the  dependence  vector  concept,  the  methodology  can  first  determine  the  possible 
partitionings  A{t)  compatible  with  p  processors.  Each  partitioning  A(t)  b  then  treated  like  a  set 
of  coupled  steps  by  the  optimization  algorithm,  with  the  constraint  that  the  resulting  array  may 
consbt  of  only  p  processors. 

4.  niostration  of  the  Design  Methodology 

In  order  to  illustrate  our  approach  an  example  b  given  which  shows  how  to  systematically 
derive  optimal  systolic  arrays  for  the  solution  of  Toeplitz  linear  systems.  The  algorithm  consbts 
of  sever^  steps  and  hence  b  complex  enough  to  provide  the  main  ideas  for  dealing  with  general 
systems  of  coupled  recurrence  equations. 

Our  design  methodology  consbts  of  several  tasks;  each  of  the  following  sections  b  devoted  to 
one  such  task.  Within  each  section,  the  necessary  notions  and  terminology  are  introduced  and  the 
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Figure  2:  Acyclic  Graph  Obtained  from  the  Directed  Cut- 
Sets  of  the  Dependence  Graph. 


4.1.  Description  of  the  Toeplitz  Hyperbolic  Choiesky  Solver 

The  Toeplitz  Hyperbolic  Choiesky  Solver  (THCS)  (15]  has  been  designed  as  an  efficient,  min' 
imal  storage,  algorithm  for  the  parallel  solution  of  linear  systems  of  equations. 


Ax  =  b. 


where  the  n  x  n  coefficient  matrix  A  =  {aij)  is  a  symmetric  positive  definite  Toeplitz  matrix, 
a|,_y|  s  Oij,  that  is,  its  elements  are  constant  along  each  diagonal. 

Several  highly  concurrent  systolic  architectures  for  the  solution  of  Toeplitz  systems  have  re¬ 
cently  been  presented  [4,  5,  14,  35,  62).  They  start  with  the  pipelined  factorization  of  A  via  variants 
of  an  old  algorithm  due  to  I.  Scbur  (cf.  [1,  60|)  but  differ  in  the  way  these  results  are  used  to  de¬ 
termine  the  solution  vector.  In  [15]  it  was  shown  how  to  use  the  rotation  parameters  computed  in 
the  Schur  algorithm  so  that  storage  is  minimized,  and  subsequent  steps  can  be  easily  pipelined  and 
closely  resemble  the  first  step. 

The  first  step,  given  in  the  recurrences  below,  computes  the  Choiesky  factor  U  of  A  (such 
that  A  =  U^U)  via  an  elimination  procedure  based  on  hyperbolic  rotations.  The  quantities  r,-; 
correspond  to  the  elements  of  the  Choiesky  factor  U,  pj  are  the  Schur  parameters  (hyperbolic 
tangents  of  the  rotations)  and  Sij  are  auxiliary  quantities. 
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Step  1  : 


1  <  t  <  Ti,  f,,o  =  a,_i 
si,o  =  0 

2  <  j  <  n,  l,,o  =  a,_i 

1  <  J  <  «,  Pj-l  =  j_i 

J  ^  ^  ^ij—l  ~  ~  Pj—l^iJ—1 

^ij—l  ~  ~ Pj—l^iJ—i  — 1 

i  +  1  <  t  <  n,  n;  =  ’■•-ij-i 
~  ^iJ-1 

In  essence,  the  first  step  performs  a  premultiplication  of  A  by  U~^ .  In  the  second  step  the  same 
operations  are  applied  to  the  right-hand  side  vector  b.  The  elements  of  y  =  U~^b  are  the  quantities 

yjj-i- 

Step  2  ; 

1  <  »  <  n,  Vifi  =  bi 
hfi  =  bi 

j  <n, 

j  <i  <  n,  yij-i  =  y,j_i  -  py_il,j_i 

=  -pj-lViJ-l  +  2,J-1 

J  +  1  <  «  <  n.  yij  =  ViJ-i 

During  the  last  step,  y  is  essentially  premultiplied  by  U~^.  Since  U~^  is  the  tratispose  of  the 
operations  are  the  same  hyperbolic  rotations  as  in  Steps  1  and  2  but  performed  in  reverse  order. 
The  solution  vector  x  =  U~^y  is  obtained  as  the  sum  of  /  and  g.  Note  that  of  all  the  quantities 
computed  in  Step  1  only  the  Schur  parameters  pj-i  and  the  diagonal  elements  r>j-i  of  C/,  hence 
just  0(n)  quantities,  are  used  in  later  steps. 

Step  3  : 

1  S  J  ^  tl,  =  y„_j+i,n-y/tn->+l,n-> 

9nJ-\  =  0 

n  -  y  +  2  <  i  <  n,  fij-i  =  /,j_j 
9i~lJ  —  9iJ 

n  ~  j  1  ^  ^  U,  Jij  =  JiJ—\  ~  Pn-j9ij—l 
9iJ  ~  ~ Pn-j  fij-\  "b  9iJ—l 

1  <  J  <  n,  Xi  =  -I-  y,,„ 

Associated  with  the  algorithm  is  a  dependence  graph  whose  nodes  are  the  variables  in  the 
algorithm  and  whose  edges  represent  the  dependences  between  variables,  see  Figure  1.  The  decom¬ 
position  of  an  algorithm  into  steps  is  determined  from  the  directed  cutsets  of  this  graph.  Here, 
the  v'ariables  are  decomposed  into  six  groups,  where  the  graph  of  the  dependences  between  these 
groups  is  acyclic,  see  Figure  2.  These  groups  define  six  steps  (the  functional  decomposition  into 
three  steps  used  to  present  the  example  is  a  simplification).  The  dependences  between  variables 
within  a  group  are  called  internal  dependences  while  the  ones  between  groups  are  called  external 
dependences. 
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(a)  (b) 

Figure  3:  Dependence  Graph  for  Step  1  Before  (a)  and  After 
(b)  Removal  of  Redundant  Variables. 

Renaming  occurs  when  a  variable  depends  on  exactly  one  other  variable  within  the  same  group 
(though  it  could  depend  on  other  variables  outside  the  group).  In  Step  1,  for  instance,  the  variables 
f  and  s  serN'e  no  other  purpose  but  renaming  and  introduce  unnecessary  dependences,  see  Figure  3. 
Upon  their  removal  the  equations  of  the  first  step  are  rewritten  as 

Step  1  : 

1  <  t  <  n.  5  a, 

si,-i  =  0 

2  <  t  <  n,  s,,_i  =  a,_i 
1  <  J  <  n.  Pj~\  = 

j  <  i  <  n,  r,j_i  =  r,_ij_2  -  Py-iSij-2 
SiJ-l  =  +  S|,>-2 

Similarly,  the  removal  of  redundant  variables  in  the  other  two  steps  yields  the  equations  below  : 
Step  2  : 

1  <  «  <  w,  yi.-\  =  bi 

^1— 1,— 1  =  bi 

1  <  i  <  nj  <  t  <  n,  y,j_i  =  y,j_2  -  pj-iZi-ij-2 

=  -pj-iyiJ-2  + 

Step  3  : 

1  ^  J  ^  /n— y+lj  — 1  —  yn—j+l,n—j/^n—j+l,n—j 

9n+lJ~l  =  0 

fij  =  fij  —  l  ~  Pn~j9i+lj—l 
9iJ  ~  ~Pn-jfiJ-l  +  9i+lJ-l 

1  <  t  <  n,  X,  = 

In  general,  the  decomposition  of  an  algorithm  into  steps  and  the  detection  of  redundant  variables 
is  a  two-pass  procedure  on  the  graph  of  the  dependences  in  the  algorithm. 
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j  j  3 


Figure  4:  Domains  of  Computation  for  Steps  1,  2  and  3. 


4.2.  Internal  Dependence  Vectors  and  Local  Transformations 

The  first  task  of  the  methodology,  a  concise  formalization  of  the  input,  is  accomplished  by 
expressing  the  dependences  within  each  step  mathematically.  All  variables  in  the  above  recurrence 
equations  have  at  most  two  indices.  If  u.y  is  a  variable  on  the  lefthand  side  of  a  recurrence  equation 
then  its  computation  can  be  visualized  as  taking  place  at  coordinate  (j,y)  in  two-space.  The  set  of 
points  corresponding  to  all  the  computed  variables  in  Step  k  will  be  called  its  domain  of  computation 
Dk,  k  =  1,2,3. 

In  this  case,  the  domains  of  computation  Z?*  take  the  form  of  triangles  and  are  depicted  in 
Figure  4.  Within  each  domain,  computations  occurring  at  the  same  point  {i,j)  are  performed  in 
the  same  processor  during  the  same  clock  cycle  while  computations  occurring  at  different  points 
(t,j)  will  be  executed  in  different  processors  or  different  cycles.  In  Figure  4  sides  of  the  triangles 
Dk  are  distinguished  by  labels  corresponding  to  quantities  computed  or  input  at  the  sides.  The 
recurrence  equations  in  Step  k  can  be  represented  as  a  directed,  regular  acyclic  computation  graph. 
The  nodes  of  the  graph  are  the  points  of  D*,  its  edges  are  called  internal  dependence  vectors  [55, 
56].  The  latter  relate  the  indices  of  variables  within  the  group  of  variables  associated  with  Step  k 
as  follows  :  for  each  variable  on  the  lefthand  side  of  an  equation  the  dependence  vectors  are 
the  vectors  from  point  (i^j)  toward  the  indices  of  the  quantities  needed  for  that  computation,  i.e., 
of  the  variables  on  the  right-hand  side  of  that  equation.  In  contrast  to  the  more  general  notion  of 
‘dependence’  which  is  used  to  determine  steps  and  detect  redundant  variables,  dependence  vectors 
are  derived  from  the  indices  of  the  variables.  Internal  data  dependences  define  the  valid  orders  of 
evaluating  the  recurrence  equations  within  a  step. 

A  vector  r  b  a  time  vector  for  Step  k  if  and  only  if  any  two  points  x  and  y  in  Dk  for  which 
the  computation  at  y  precedes  the  computation  at  x  satisfy  y^r  <  x^r  (here  and  in  the  sequel,  all 
vectors  are  column  vectors).  It  can  be  shown  that  a  time  vector  r  satisfies 

6^7-  <  0 

for  all  internal  dependence  vectors  6  (‘causality  constraint’).  Note  the  strict  inequality  in  the  above 
characterization.  If  6^t  =  0  for  some  6  either  broadcast  or  ‘rippling’,  an  0(n)  number  of  operations 
sequentially  applied  to  the  same  datum  in  one  clock  cycle,  would  occur.  With  the  strict  inequality 
the  clock  cycle  b  always  on  the  order  of  the  time  needed  to  perform  an  elementary  operation 
(division,  multiply-and-accumulate)  and  is  essentially  independent  of  the  number  of  processors. 

It  follows  from  the  causality  constraint  that  the  time  vectors  for  Step  k  form  a  convex  cone 
Ck  (the  origin  of  this  notion  goes  back  to  the  theory  of  multi-dimensional  systems  [48]).  The 
‘wider’  the  cone  Ck,  the  more  choices  for  the  evaluation  of  the  recurrences  in  Step  k  exist.  Thus, 
if  one  objective  b  to  minimize  computation  time,  the  freedom  in  assigning  operations  to  points  in 
Dk  should  be  exploited  in  order  to  obtain  a  time  cone  as  wide  as  possible.  Next,  assignments  of 
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operations  within  the  domains  of  computation  for  the  three  steps  of  THCS  are  derived  in  such  a 
way  that  the  associated  time  cones  are  maximal. 

It  is  important  to  observe  that  there  are  many  ways  of  writing  the  recursions  in  each  step.  For 
example,  in  Step  1  one  could  translate  the  indices  of  s  while  leaving  unchanged  the  indices  of  r, 
e.g.,  Sij-i  would  become  and  the  inner  recursion  in  Step  1  would  take  the  form 


Sij  =  -Py-in-ij-2  +  Sij-l- 

The  dependence  vectors  change  accordingly,  and  so  does  the  time  cone  for  Step  1.  Since  the  coor¬ 
dinates  of  the  points  in  the  domains  of  computations  are  integers,  the  vectors  and  transformations 
considered  now  and  in  the  sequel  are  all  integer-valued.  In  general,  if  the  recursions  contain  vari¬ 
ables  coupled  through  dependence  vectors  of  length  o(n),  one  should  only  consider  ‘small'  relative 
translations  of  o{n)  for  the  indices  of  these  variables.  If  the  relative  transformation  were  an  affine 
linear  transformation  that  either  is  a  translation  of  size  0(n)  or  is  not  a  translation  the  time  cone 
would  actually  become  smaller.  The  relative  translations  of  size  o(n)  on  the  indices  of  the  variables 
within  a  step  are  local  transformations . 

The  local  transformations  yielding  the  largest  time  cone  impose  minimal  constraints  on  the 
data  flow,  and  minimum  time  designs  are  obtained  by  assuming  that  such  a  transformation  has 
been  selected.  Since  the  boundaries  of  the  domains  of  computation  may  differ  slightly,  by  o(n),  for 
these  various  transformations,  general  area/time  results  are  expressed  up  to  additive  terms  of  size 
o(n)  (the  stated  results  hold  for  arbitrary,  sufficiently  large  n).  If  there  are  specific  constraints  on 
the  data  flow  (e.g.,  limitations  on  interprocessor  connections  such  as  a  fixed  interconnection  pattern 
for  all  the  steps)  the  local  transformations  will  be  selected  to  maximize  the  time  cones  under  these 
constraints. 

For  the  recursions  of  Step  1,  an  example  of  a  local  transformation  inducing  minimal  constraints 
is  one  which  leaves  the  indices  of  s  unchanged  (with  respect  to  r)  and  assigns  the  computation  of 
Pj-i  to  point  {j,j  —  1),  i.e.,  py_i  becomes  Pjj-i-  Alternatively,  the  computation  of  pj^i  could  be 
assigned  to  {j,j  —  2),  resulting  in  equivalent,  least  restrictive,  constraints  on  the  data  flow. 

The  dependence  vectors  for  Step  1  are  then 

(;!)•  i-i)'  Co')' 

which  can  be  summarized  in  the  matrix 

_  /-I  0  0  -1  -2  ...  1  -  nN 

~  V-1  -1  0  0  0  ...  0  )' 


The  (0  0  vector  constitutes  a  constraint  within  a  processor  ;  the  computation  of  ry j_i  requires 
pj-\  (both  quantities  can  actually  be  computed  in  parallel  with  a  non-restoring  division  algorithm). 
The  (-2  0)^...(l-n  0)^  vectors  all  have  the  same  orientation  and  direction  as  the  (  — 1  0)^ 

vector  and  do  not  add  more  constraints  to  the  optimization  problem.  Since  the  vector  (  -1  —  1  )^ 

is  inside  the  convex  cone  spanned  by  the  vectors  (  —  1  0)^  and  (0  —1)^,  the  dependence  vector 

matrix  for  Step  1  may  be  further  reduced  to 
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•  [35]  Under  considerable  restrictions,  the  design  of  matrix  multiplication  arrays  is  attempted, 
where  systolic  arrays  are  characterized  through  velocities  of  data  flows,  spatial  distributions  of 
data  and  periods  of  computation. 

•  [47]  Using  a  divide-and-conquer  approach  the  efflciency  of  those  designs  is  increased  in  which 
each  processor  operates  only  every  k-th  time  step;  the  approach  is  illustrated  on  convolution 
and  matrix  multiplication. 

•  [13]  The  definition  of  functional  operators  for  the  expression  of  vector  algorithms  makes  it  pos¬ 
sible  to  establish  a  correspondence  between  data  flow  models  of  MIMD  computation  and  design 
of  systolic  arrays.  Conditions  are  specified  under  which  a  systolic  version  of  a  computational 
graph  can  be  executed  asymptotically  as  fast  as  the  fully  concui.ent  version  of  the  original 
data  flow  graph. 
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associated  with  a  given  algorithm.  The  method  is  applied  to  the  unshifted  QR  method  for^ 
calculating  the  eigenvalues  of  a  symmetric  matrix;  one  way  of  performing  problem  partitioning 
is  sketched. 

•  [l,  44,  45]  The  algorithm  is  expressed  as  a  set  of  uniform  recurrence  equations  whose  indices 
form  a  convex  domain  in  the  cartesian  coordinate  system.  At  first  a  timing  function  is  de¬ 
termined  which  specifies  the  execution  time  of  a  single  computation  and  is  compatible  with 
the  data  dependences  of  the  algorithm.  Then,  an  allocation  function  is  found  which  maps  the 
algorithm  onto  a  processor  array.  This  appears  to  be  one  of  the  mathematically  cleanest  and 
most  powerful  approaches. 

•  [33,  34]  A  highly  mathematical  theory  is  developed  which  combines  and  enhances  the  above 
concepts  of  data  dependences  and  geometric  linear  transformations;  it  is  illustrated  for  the 
basic  linear  algebra  algorithms.  The  theory  is  extended  to  the  process  of  hardware  synthesis 
where  the  outputs  of  the  algorithm  can  be  specified  as  a  set  of  polynomials  in  some  universal, 
many-sorted  algebra  (which  does  not  constitute  a  restriction,  since  it  is  shown  that  the  output 
of  any  algorithm  performed  by  a  Turing  machine  can  be  described  by  a  set  of  polynomials). 
Non-linear  transformations  are  applied  to  increase  hardware  utilization.  Together  with  [44], 
this  is  the  most  mathematical  and  powerful  approach. 

•  [29,  30,  31,  32]  A  general  methodology  is  derived  that  eliminates  broadcasting  from  a  syn¬ 
chronous  circuit  without  changing  its  communication  structure.  Although  it  neither  describes 
nor  synthesises  systolic  arrays,  this  method  can  be  used  to  justify  some  of  transformations  that 
replace  global  broadcasting  with  local  signal  propagation. 

•  [21]  The  algorithm  is  represented  as  a  signal  flow  graph  which  b  then  converted  to  a  systolic 
array.  After  determining  the  set  of  operations  to  be  executed  by  a  single  processor,  the  time 
of  computation  of  each  operation  b  derived  by  using  the  techniques  of  the  approach  above 
(considering  cut-sets,  inserting  and  scaling  of  delays). 

•  [24]  Based  on  the  example  of  finite  and  infinite  impube  response  filters,  thb  approach  uses  an 
algebriuc  representation  to  combine  the  notion  of  z-operator  and  ^retiming’  techniques  similar 
to  the  ones  in  [29,  30,  31,  32],  leading  to  an  algebra  for  deriving  designs  whose  z-graphs  have 
systolic  properties. 

•  [27,  28]  A  program  takes  as  input  a  nested-loop  algorithm  and  outputs  a  corresponding  systolic 
design  after  having  performed  the  following  bottom-up  process  :  starting  with  the  innermost 
loops  of  the  algorithm,  the  program  allocates  processors  to  particular  operations  (with  help  from 
the  user),  schedules  their  computation  and  performs  a  small  set  of  transformations  (selected  by 
the  user)  to  improve  the  design.  The  program  relies  heavily  on  user  intervention  and  insight, 
the  sequence  of  applied  transformations  does  not  result  in  the  optimal  design,  but  constitutes 
a  ‘bag  of  tricks’  whose  usefulness  was  determined  by  experience.  It  has  been  used  to  derive 
basic  systolic  arrays  for  polynomial  evaluation,  matrix  multiplication,  dynamic  programming 
and  greatest  common  divisor  computation. 

•  [7]  First,  the  algorithm  b  cast  into  a  system  of  first-order  recurrence  equations,  by  adding  and 
substituting  variables.  Then  it  b  mapped  onto  space-time  recurrence  equations,  starting  with 
the  initial  conditions  and  proceeding  inductively  along  a  chosen  time  direction.  On  account  of 
the  induction  process,  algorithms  are  allowed  as  input  that  do  not  necessarily  lead  to  designs 
with  a  repetitive  structure  or  to  data  flows  of  uniform  speed.  However,  already  for  simple 
algorithms  the  notation  becomes  lengthy  and  awkward,  while  the  restriction  to  first-order 
recurrences  may  exclude  important  designs. 
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Appendix  :  Literature  Survey 

It  has  been  recognized  that  the  characterization  of  systolic  arrays  by  graphical  representations 
of  data  flow  is  not  sufficient  to  guarantee  their  correctness  or  explain  their  workings,  see  [11]  for  a 
recent  survey.  Attempts  at  appropriate  formalizations  can  be  essentially  divided  into  two  categories, 
verification  of  the  correctness  of  systolic  designs  and  systematic  synthesis  of  systolic  designs  (which 
are  correct  by  construction). 

Methods  aimed  at  mere  verification  include 

•  [6,  8]  Algorithms  are  expressed  as  recursive  space-time  progr2ims  and  inductive  techniques  are 
applied  for  their  verification. 

•  [36,  37,  38]  Data  appearing  at  a  processor  interconnection  at  successive  time  instances  are 
represented  as  a  data  sequence.  The  computations  performed  by  a  processor  are  modeled  as 
causal  sequence  operators,  the  latter  being  difference  equations  which  relate  the  input  data 
sequences  to  the  output  data  sequences.  The  operation  of  the  whole  array  is  verified  by  solving 
the  systems  of  difference  equations  associated  with  all  processors  and  showing  that  they  comply 
with  the  intended  I/O  description  of  the  whole  array.  However,  it  is  not  always  possible  to 
solve  the  systems  of  diflference  equations  analytically.  The  model  is  applied  to  the  verification 
of  a  pipelined  systolic  system  that  performs  the  assembly  of  Finite  Element  stiffness  matrices. 

•  [48]  An  induction  proof  b  given  for  the  correctness  of  an  array  that  performs  the  QR  decom¬ 
position  of  a  banded  matrix  [14]. 

•  [25,  26]  Based  upon  the  wavefront  concept  (see  below)  space-time-data  equations  are  used  to 
describe  the  movements  of  data  elements.  The  operation  of  the  hexagonal  matrix  multiplication 
array  [23]  is  then  shown  to  be  correct  by  basically  Simulating'  the  movement  of  all  data  elements 
through  the  whole  array.  The  approach  is  tedious  if  the  geometry  is  not  very  regular  or  the 
data  flow  patterns  are  complicated. 

In  all  of  the  above  approaches  the  notation  becomes  awkward  and  unwieldy  when  applied  to 
complex  computations.  The  following  methodologies  aim  at  the  synthesis  of  new  systolic  designs 
which,  by  construction,  are  correct  and  do  not  need  a  posteriori  verification. 

•  [9,  15,  16,  50]  The  storage  elements  in  a  systolic  system  are  described  via  the  so-called  z- 
operator,  2[a:(i)]  =  x(i  -  1),  which  models  a  time  delay  operator.  On  account  of  the  correspon¬ 
dence  between  mathematical  expressions  and  systolic  systems  (consisting  only  of  logical  and 
memory  elements)  the  mathematical  formulae  can  be  manipulated  to  yield  different  systolic 
arrays. 

•  [17,  22,  49]  The  uniform  progression  in  time  or  space  of  a  set  of  data  elements,  which  do 
not  belong  to  the  same  data  stream,  can  be  viewed  as  a  wavefront;  the  z-operator  is  used  to 
represent  the  delay  or  rotation  in  the  direction  of  wavefronts. 

•  [2,  3,  4,  5,  18,  19,  20,  39,  46]  Nested-loop  algorithms  are  mapped  onto  systolic  arrays,  whereby 
the  computation  is  modeled  as  a  ‘lattice'  whose  nodes  represent  operations  and  whose  edges 
represent  the  computational  data  dependences.  Geometric  linear  transformations  are  first  ap¬ 
plied  to  the  lattice.  The  transformed  lattice  b  then  mapped  onto  the  space-time  domain,  where 
the  coordinates  of  each  node  indicate  at  which  time  and  in  which  processor  the  computation 
b  to  be  performed. 

•  [10,  11,  12,  40,  41,  42,  43]  Extending  the  approach  of  [19,  20],  nested-loop  algorithms  are 
mapped  onto  systolic  arrays.  Linear  transformations  are  applied  to  the  matrix  representing 
the  data  dependences.  The  integer  solution  to  thb  matrix  equation  yields  a  mapping  from 
algorithm  to  architecture.  The  software  implementation  of  thb  method  enumerates  systolic 
designs  but  contains  no  systematic  w’ay  of  reducing  the  search  space  of  all  systolic  arrays 
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A  space-time  representation  for  one  possible  design  is  given  in  Figure  8. 


5.  Summary 

Our  methodology  extracts  data  dependences  from  the  recurrence  equations  to  summarize  the 
information  about  ail  possible  ways  in  which  these  equations  can  be  evaluated.  The  optimal  design 
is  selected  from  these  possibilities  by  symbolically  applying  linear  (affine)  transformations  to  the 
computation  graph  of  the  equations  and  then  determining  the  transformation  parameters  so  that 
the  resulting  array  is  optimal  with  respect  to  a  hierarchy  of  objectives.  Hardware  regularity  is 
obtained  by  exploiting  the  regularity  of  the  data  dependences,  that  is,  by  taking  advantage  of  their 
invariance  under  translation  in  a  geometric  representation  of  the  computations.  The  employed 
transformations  preserve  the  regularity  and,  in  some  cases,  might  even  be  used  to  enhance  it. 

When  minimization  of  computation  time  is  the  main  objective,  decomposition  of  the  optimiza¬ 
tion  problem  into  separate  local  and  global  optimization  problems  has  parallels  in  the  design  of 
MOS  networks  (61).  In  both  cases,  the  internal  dependences  within  each  step  (equivalent  to  areas 
of  intraconnections  within  a  cell)  can  be  optimized  with  almost  no  consideration  for  the  external 
dependences  between  steps  (equivalent  to  interconnections  outside  the  cells).  For  other  hierarchies 
of  objectives,  the  decomposition  into  separate  optimization  problems  for  local  and  global  transfor¬ 
mations  may  not  be  possible,  which  could  translate  into  a  quadratic  programming  problem.  We 
have  come  across  techniques  that  solve  these  quadratic  problems  in  an  acceptable  amount  of  time. 

Although  the  methodology  was  illustrated  by  way  of  an  example  involving  only  two-index 
variables,  it  should  be  clear  that  there  is  no  conceptual  difficulty  in  applying  it  to  recursions  on 
more  indices.  The  most  general  case  involves  several  time  and  processor  directions,  where  multiple 
time  directions  are  best  understood  as  directions  for  scanning  the  operations  and  are  mapped 
accordingly  onto  the  physical  time. 
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Processors 


Figure  8:  An  Example  of  a  Systolic  Array  with  Computa¬ 
tion  Time  5n,  n  Processors  and  Input/Output  on  the  two 
Boundary  Processors. 


Figure  7  depicts  the  space-time  representation  for  two  such  arrays.  Note,  that  enforcing  the 
non-overlap  of  computations  is  more  involved  than  just  ensuring  that  the  polygons  which  bound 
the  domains  of  computation  are  not  superimposed. 

The  one  characteristic  shared  by  all  of  the  above  designs  is  that  Input/Output,  i.e.,  commu¬ 
nication  with  the  outside  world,  takes  place  in  each  processor.  In  case  this  is  undesirable,  the  next 
step  then  consists  of  determining  the  class  of  arrays  with  the  minimal  number  of  processors  among 
those  of  shortest  computation  time  subject  to  the  constraint  that  Input/Output  may  take  place 
on  at  most  two  processors.  The  requirement  for  a  minimal  processor  count  and  the  Input/Output 
constraint  reduce  the  set  of  possible  choices  for  the  parameters  cjt,  and  vt  to 
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Solving  the  subsequent  optimization  problem  demonstrates  that  the  minimal  computation  time  on 
an  array  with  n  processors  and  Input/Output  restricted  to  the  two  boundary  processors  comes  to 
5n  -b  o(n)  and  is  attained  for  the  following  thirty-two  designs  : 
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subject  to  the  causality  constraints.  Because  the  objective  function  tj  is  the  maximum  of  two  linear 
functions  and  the  constraints  are  linear,  the  optimization  can  be  cast  as  a  standard  linear  integer 
programming  problem.  To  fully  determine  the  transformations  Tt,  the  remaining  parameters  ct, 
dk  and  Vk  are  found  subject  to  the  conditions  that  no  two  computations  from  the  same  step  are 
performed  in  the  same  processor  at  the  same  time  (‘nonsingularity  constraint’),  and  that  no  two 
computations  from  different  steps  are  performed  in  the  same  processor  at  the  same  time  (‘no-overlap 
constraint’). 

It  is  natural  now  to  ask  which  of  the  arrays,  among  all  those  with  minimal  computation 
time,  possess  the  minimal  number  of  processors.  Clearly,  the  minimal  number  of  processors  is 
n,  assuming  that  no  ‘problem  partitioning’  is  undertaken.  It  then  follows  that  the  transformed 
domains  of  computation  of  all  the  steps  must  belong  to  a  horizontal  band  of  width  n,  as  drawn  in 
Figure  6,  leading  to  a  set  of  constraints  on  the  possible  values  of  the  parameters  ct,  and  vt  : 


(1)  0  <  cin  <  n,  0  <  cin  +  din  <  n 

(2)  0  <  vj  <  0  <  cjn  -I-  U2  <  n,  0  <  c^n  +  d^n  -I-  V2  <  n 

(3)  0  <  Can  -I-  ua  <  n,  0  <  d^n  -b  ua  <  0  ^  can  -t-  dzn  -b  va  <  n. 

Solving  the  optimization  problem  (4.1)  subject  to  these  constraints  combined  with  the  causality 
constraints  for  at,  bt  and  ut,  the  condition  that  Tt  be  nonsingular  and  the  no-overlap  constraint 
results  in  thirty-two  systolic  arrays  with  minimal  computation  time  3n  and  minimal  processor 
count  n  : 
either 
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Figure  6:  Horizontal  Band  for  Arrays  with  Minimal  Processor 
Count. 

Processors 


Figure  7;  Two  Examples  of  Systolic  Arrays  with  n  Processors 
and  Computation  Time  3n. 

Further  manipulation, 

0  <  6in  <  ain  +  bin  <  can  +  han  +  ua 

<  aafi  +  U3  <  Can  +  6371  +  1x3 

<  It, 

and  the  fact  that  the  transformations  are  integer-valued  imply 

It  ^  3n. 

Thus,  the  total  computation  time  is  bounded  below  by  3n  -f-  o(n). 

There  are  in  general  many  transformations  resulting  in  designs  that  achieve  this  lower  bound 
and  hence  minimize  the  computation  time.  They  are  characterized  by  solving  for  the  parameters 
at,  bh  and  ujt,  k  =  1,2,3,  the  optimization  problem 


a/  _  f  ~0’1  -bi\  _  (-a^-bi  -62  N  A'  _  -bz\ 

-dj’  ^2-\^-C2-d2  -djj’  -d^)- 


The  domains  of  computation  are  mapped  into 


n\  _  fO  ain  ain+  ^r>\ 
n)  \0  cin  c\n  +  d\n) 


7-/0  n 
0 


n\  _  /u2  a2n  +  «2  a2n  +  ^>2«  +  «2\ 
n  y  ~  \  V2  C2n  +  V2  C2n  +  d^n  +  V2  y 

n  /  aan  +  1/3  6371  +  U3  0311  +  b^n  +  1/3 ' 

n  )  \  czn  +  t>3  dzn  +  V3  czn  +  dzn  +  vz  ^ 


where  without  loss  of  generality  ui  and  vi  were  set  to  zero.  The  extreme  external  dependence 
vectors  between  Steps  1  and  2  for  the  variable  pj-i  are  given  by  Ai_2,  between  Steps  1  and  3  for 
variables  pj-i  and  rjj-i  by  A'j.j,  and  between  Steps  2  and  3  for  the  variables  yjj-i  by  Aj.g, 
where 

_^“W2  -a2n  —  «2  ain  +  6in  —  a2n  -  62”  ~  «2N 

^2  —  C2n  —  U2  c\n  +  d\n  —  Czn  —  d^n  —  Vi  ) 

6371— «3  -azn  —  bzn-uz  oin  +  6in  -  0371  —  usX 
\-dzn-vz  -czn  —  dzn-vz  c^n  +  d^n  -  czn  ~  vz  J 


bzn  —  uz  —azn  —  6371  -  U3  oiti  +  6i7i  —  0371  —  U3\ 
dzn  —  vz  —czn  —  dzn  —  vz  c\n din  —  czn  ~  vz  ) 

—  ( <^2«  +  bin  +  U2  “  aa”  ~  «3  “2  -  bzn  -  ti3\ 

\  C271  +  din  +  V2  -  C371  —  vz  Vi  —  dzn  -  vz  )  ' 


Without  loss  of  generality  the  time  vector  in  the  transformed  domain  may  be  selected  as 
T  =  [  1  0]^.  The  causality  constraints  6'^r  <  0  for  all  transformed  dependence  vectors  6'  are  thus 
summed  up  by  requiring  the  first  components  in  the  dependence  vector  matrix 


A  — (A'l  A'i  A3  Aj_2  Aj_3  A2_3) 


to  be  negative  : 


(1)  — oi  <  0,  —61  <  0 

(2)  —02  ~  bi  <  0,  —bi  <  0 

(3)  03  —  63  <  0,  —63  <  0 

(1-2)  —U2<0,  — 0271  -  U2  <  0,  Oi7» -f  6i71  —  0271  —  6271  —  «2  <  0 

(1-3)  -6371  -  tt3  <  0,  -0371  -  6371  —  W3  <  0,  OlTl 6i71  -  0371  -  U3  <  0 

(2-3)  0271  -I-  bin  -h  U2  -  0371  -  U3  <  0,  Ui~  bzn  -  U3  <  0 

4.4.  Optimization 

The  computation  time  for  THCS  is  given  by  the  difference  of  the  times  between  the  first  and 
the  last  computation, 

It  =  max  {t}  -  min  {i}, 

(ij)€uD;'  (.jieuoi' 

where  uI7]^  is  the  union  of  the  transformed  domains  of  computation  for  Steps  1,  2,  and  3.  Systematic 
manipulations  of  the  above  inequalities  show  that  the  total  computation  time  amounts  to 


ty  =  max{63n  -|-  03,  0371  •+•  63x1  +  «3}. 


(a)  (b) 

Figure  5:  Examples  of  Dependence  Vectors,  (a)  Internal,  (b) 

External. 

for  time  optimization  purposes.  A  similar  treatment  of  the  dependence  vectors  for  Steps  2  and  3 
results  in  dependence  vector  matrices 

-.=(:; 

Local  transformations  influence  the  size  of  the  time  cone.  A  maximal  time  cone,  i.e.,  one  that 
is  as  ‘wide’  as  possible,  induces  minimal  constraints  on  the  data  flow  and  hence  results  in  designs  of 
minimal  computation  time.  Finding  the  local  transformations  that  result  in  a  maximal  time  cone 
constitutes  an  optimization  problem.  We  believe  that  for  general  systems  of  recurrence  equations, 
this  problem  can  be  viewed  as  a  particular  graph  problem  similar  to  the  one  arising  in  optimal 
placement  for  river  routing  [43]  and  hence  can  be  solved  efficiently. 

4.3.  External  Dependence  Vectors  and  Global  Transformations 

In  order  to  jointly  position  all  three  domains  D*  in  space-time  for  the  optimal  realization  of 
THCS,  separate  global  transformations  are  symbolically  applied  to  each  Dh.  The  purpose  of  these 
transformations  is  to  ‘stretch’  or  ‘skew’  the  domains  of  computation  and  pack  them  as  tightly  as 
possible,  so  that  the  computation  time  and  number  of  processors  axe  kept  to  a  minimum.  In  doing 
so,  one  must  take  into  account  both  ‘internal’  and  ‘external’  dependences,  the  latter  referring  to 
the  dependences  between  steps,  see  Figure  5.  In  order  to  compute  all  steps  along  a  certain  time 
direction  without  violating  causality,  that  time  direction  should  be  inside  the  time  cones  of  all 
steps  as  well  as  inside  the  time  cone  defined  by  the  external  dependences.  This  establishes  precise 
constraints  on  the  possible  stretchings  and  skewings  for  the  steps  as  well  as  on  their  respective 
placements,  and  therefore  on  the  parameters  of  the  global  transformations.  Thus,  the  values  of  the 
global  transformations  will  be  determined  by  an  optimization  procedure  subject  to  these  causality 
constraints. 

The  global  transformations  for  Step  k  are  defined  through 

with  akdic  —  bt-Ck  ^  0,  as  shown  below.  Nonsingularity  is  required  to  avoid  mapping  operations 
corresponding  to  different  points  in  the  domain  of  computation  onto  the  same  processor  at  the 
same  time.  The  matrices  of  extreme  internal  dependence  vectors  A*  are  transformed  to 


Ai  =  r*A*, 
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